Support Youtu-VL Model by f291400 · Pull Request #18315 · ggml-org/llama.cpp

f291400 · 2025-12-23T06:58:30Z

Support for the large youtu-vl model, which will be open-sourced soon.

ngxson · 2025-12-23T09:49:10Z

tools/mtmd/clip.cpp

                            LOG_WRN("%s: more info: https://github.com/ggml-org/llama.cpp/issues/16842\n\n", __func__);
                        }
                    } break;
+                case PROJECTOR_TYPE_UTUVL:


instead of duplicating the code block, adding it to the same case above instead:

case PROJECTOR_TYPE_QWEN2VL: case PROJECTOR_TYPE_QWEN25VL: case PROJECTOR_TYPE_QWEN3VL: case PROJECTOR_TYPE_UTUVL:

tools/mtmd/clip.cpp

ngxson · 2025-12-23T09:50:31Z

tools/mtmd/models/utuvl.cpp

delete this file, reuse qwen2vl.cpp

utu differs from qwen2.5 in several aspects. It's difficult to merge together

can you list these differences?

change conv3d to linear

const bool full_attn = use_window_attn ? (il + 1) % n_wa_pattern == 0 : true; chang to
const bool full_attn = (il + 1) % 8 == 0 || il == n_layer - 1; added il == n_layer-1, which n_layer = 27

delete ff_gate_w in build_ffn

exchange merge and windows attention

ok thanks. please also leave this list in the code comment

tools/mtmd/models/utuvl.cpp

ngxson · 2025-12-23T12:42:01Z

tools/mtmd/models/utuvl.cpp

+    // loop over layers
+    for (int il = 0; il < n_layer; il++) {
+        const auto & layer = model.layers[il];
+        const bool full_attn = (il + 1) % 8 == 0 || il == n_layer - 1;


what's the number 8 here? can it be a hparam ?

I have fixed it.

tools/mtmd/clip.cpp

ngxson · 2025-12-23T12:46:01Z

tools/mtmd/clip.cpp


                if (use_window_attn) {
-                    const int attn_window_size = 112;
+                    const int attn_window_size = ctx->model.proj_type == PROJECTOR_TYPE_QWEN25VL ? 112 : patch_size * 2 * 8;


extract these number to a new hparams instead.

hparams.attn_window_size

I have fixed it.

ngxson · 2025-12-23T12:47:33Z

please also fix any failed check regarding code formatting

CISC · 2025-12-23T13:19:40Z

convert_hf_to_gguf.py

+        if chkhsh == "9d70134b369a70e5735009b6de918f7581b5211f7c074d1f89f753aea8248af1":
+            res = "utu-vl"


Do not manually add these, they are generated by convert_hf_to_gguf_update.py, edit that and run it to get the correct entry here!

CISC · 2025-12-23T13:22:39Z

convert_hf_to_gguf.py

+        if hparams.get("moe_intermediate_size") is not None:
+            self.gguf_writer.add_expert_feed_forward_length(hparams["moe_intermediate_size"])
+        else:
+            self.gguf_writer.add_expert_feed_forward_length(hparams.get("intermediate_size", 0))
+
+        if hparams.get("n_routed_experts") is not None:
+            self.gguf_writer.add_expert_count(hparams["n_routed_experts"])
+
+        if hparams.get("n_shared_experts") is not None:
+            self.gguf_writer.add_expert_shared_count(hparams["n_shared_experts"])
+        else:
+            self.gguf_writer.add_expert_shared_count(0)
+
+        if hparams.get("routed_scaling_factor") is not None:
+            self.gguf_writer.add_expert_weights_scale(hparams["routed_scaling_factor"])
+        else:
+            self.gguf_writer.add_expert_weights_scale(1.0)
+
+        if hparams.get("norm_topk_prob") is not None and hparams["norm_topk_prob"]:
+            self.gguf_writer.add_expert_weights_norm(hparams["norm_topk_prob"])


Don't get the same value in condition and assignment, use walrus assignment in the condition as seen elsewhere.

I have fixed it.

CISC · 2025-12-23T13:24:19Z

convert_hf_to_gguf.py

+        # skip lm_head.weight if tie_word_embeddings is True
+        if self.hparams.get("tie_word_embeddings", False):
+            # Save token_embd for potential duplication as output if tie_word_embeddings is True
+            if name == "model.embed_tokens.weight":
+                self._token_embd = data_torch
+            if name == "lm_head.weight" or name == "model.lm_head.weight":
+                logger.info("Skipping tied output layer 'lm_head.weight' - will duplicate from token_embd.weight")
+                return []


Don't do this on conversion, do it on model load like every other model.

I have fixed it.

ngxson · 2025-12-24T12:37:34Z

gguf-py/gguf/gguf_writer.py

-    def add_vision_n_wa_pattern(self, value: int) -> None:
-        self.add_uint32(Keys.ClipVision.N_WA_PATTERN, value)
+    def add_vision_wa_layers(self, layers: Sequence[int]) -> None:
+        self.add_array(Keys.ClipVision.WA_LAYERS, layers)


revert this change, add a dedicated metadata instead

ngxson · 2025-12-24T12:39:56Z

gguf-py/gguf/constants.py

        USE_GELU            = "clip.use_gelu"
        USE_SILU            = "clip.use_silu"
-        N_WA_PATTERN        = "clip.vision.n_wa_pattern" # used by qwen2.5vl
+        WA_LAYERS           = "clip.vision.wa_layers" # used by qwen2.5vl and utuvl


revert this change, add a dedicated metadata instead

ngxson · 2025-12-24T12:40:28Z

convert_hf_to_gguf.py

+            # save window attention layers (full attention block indexes)
            fullatt_block_indexes = hparams.get("fullatt_block_indexes")
            assert fullatt_block_indexes is not None, "fullatt_block_indexes is required for qwen2_5_vl"
-            n_wa_pattern = fullatt_block_indexes[0] + 1
-            # validate n_wa_pattern
-            for i in range(1, len(fullatt_block_indexes)):
-                if fullatt_block_indexes[i] - fullatt_block_indexes[i - 1] != n_wa_pattern:
-                    raise ValueError(f"Invalid fullatt_block_indexes: {fullatt_block_indexes}")
-            self.gguf_writer.add_vision_n_wa_pattern(n_wa_pattern)
+            self.gguf_writer.add_vision_wa_layers(fullatt_block_indexes)


revert any changes to qwen code

ngxson · 2025-12-24T12:42:57Z

tools/mtmd/clip.cpp

+                        hparams.set_warmup_n_tokens(46*46); // avoid OOM on warmup
+                        const int warn_min_pixels = 1024 * hparams.n_merge * hparams.n_merge * hparams.patch_size * hparams.patch_size;
+                        if (hparams.image_min_pixels < warn_min_pixels) {
+                            LOG_WRN("%s: Youtu-VL models require at minimum 1024 image tokens to function correctly on grounding tasks\n", __func__);


is this warning true or just a blindly copy-paste?

I just updated the code.

ngxson · 2025-12-24T12:46:16Z

tools/mtmd/clip.cpp

+                    return std::max(align_size, aligned);
+                };
+
+                // Binary search with 0.02 step size


is this a binary or linear search? where is the binary part?

sorry，I just fix annotation

ngxson · 2025-12-25T08:49:34Z

vendor/minja/minja.hpp

and changes to minja must be done on upstream project, not here

ngxson · 2025-12-25T08:52:41Z

tools/mtmd/clip.cpp

+                        for (auto & layer : wa_layers_vec) {
+                            hparams.wa_layers.insert(layer);
+                        }
+                        hparams.set_limit_image_tokens(1, 62500);


are you sure your model actually support 62500 tokens per image? how did you calculate it?

ngxson · 2025-12-25T08:53:00Z

tools/mtmd/clip.cpp

+                        const int warn_min_pixels = 1 * hparams.n_merge * hparams.n_merge * hparams.patch_size * hparams.patch_size;
+                        if (hparams.image_min_pixels < warn_min_pixels) {
+                            LOG_WRN("%s: Youtu-VL models require at minimum 1 image tokens to function correctly on grounding tasks\n", __func__);


I don't see the point of this check. it's redundant. delete it

f291400 · 2025-12-30T06:36:49Z

@ngxson The project associated with this pull request was previously deleted. I pushed the local offline project to the master branch, but I can't reopen the pull request. You can check the differences in the new commits.

f291400 · 2025-12-30T06:37:32Z

#18479

Support Youtu-VL Model

8733bf3

f291400 requested review from CISC, ggerganov and ngxson as code owners December 23, 2025 06:58

github-actions bot added examples python python script changes labels Dec 23, 2025

ngxson mentioned this pull request Dec 23, 2025

[Mirror] Support Youtu-VL Model ngxson/llama.cpp#67

Closed

ngxson requested changes Dec 23, 2025

View reviewed changes

merge code

1600974

ngxson reviewed Dec 23, 2025

View reviewed changes

tools/mtmd/clip.cpp Show resolved Hide resolved

ngxson reviewed Dec 23, 2025

View reviewed changes

CISC reviewed Dec 23, 2025

View reviewed changes

fix bug

867709c

ngxson requested changes Dec 24, 2025

View reviewed changes

ngxson reviewed Dec 24, 2025

View reviewed changes

f291400 added 3 commits December 25, 2025 11:05

revert qwen2 code & support rsplit in minja.hpp

3e816b2

update warm info

3ec91fb

fix annotation

251852a

f291400 requested a review from ngxson December 25, 2025 03:31

ngxson reviewed Dec 25, 2025

View reviewed changes

vendor/minja/minja.hpp

Copy link

Collaborator

ngxson Dec 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and changes to minja must be done on upstream project, not here

f291400 closed this Dec 25, 2025

ngxson requested changes Dec 25, 2025

View reviewed changes

ngxson mentioned this pull request Dec 25, 2025

Support Youtu-VL Model #18367

Closed

		if chkhsh == "9d70134b369a70e5735009b6de918f7581b5211f7c074d1f89f753aea8248af1":
		res = "utu-vl"

Conversation

f291400 commented Dec 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

f291400 Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

f291400 Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngxson commented Dec 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

f291400 commented Dec 30, 2025

Uh oh!

f291400 commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

f291400 Dec 23, 2025 •

edited

Loading

f291400 Dec 23, 2025 •

edited

Loading

f291400 commented Dec 30, 2025 •

edited

Loading